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J> ■ Abstract 

'. Machine Learning competitions such as the Netflix Prize have proven reasonably 

successful as a method of "crowdsourcing" prediction tasks. But these competitions 
have a number of weaknesses, particularly in the incentive structure they create for the 
participants. We propose a new approach, called a Crowdsourced Learning Mechanism, 
in which participants collaboratively "learn" a hypothesis for a given prediction task. 
The approach draws heavily from the concept of a prediction market, where traders 
bet on the likelihood of a future event. In our framework, the mechanism continues 
to publish the current hypothesis, and participants can modify this hypothesis by 
wagering on an update. The critical incentive property is that a participant will profit 
an amount that scales according to how much her update improves performance on a 
released test set. 



(N 



1 Introduction 

The last several years has revealed a new trend in Machine Learning: prediction and learning 
problems rolled into prize-driven competitions. One of the first, and certainly the most well- 
known, was the Netflix prize released in the Fall of 2006. Netflix, aiming to improve the 
algorithm used to predict users' preferences on its database of films, released a dataset of 
100M ratings to the public and asked competing teams to submit a list of predictions on a test 
set withheld from the public. Netflix offered $1,000,000 to the first team achieving prediction 
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accuracy exceeding a given threshold, a goal that was eventually met. This competitive 
model for solving a prediction task has been used for a range of similar competitions since, 
and there is even a new company ( jkaggle . com[ ) that creates and hosts such competitions. 
Such prediction competitions have proven quite valuable for a couple of important reasons: 
(a) they leverage the abilities and knowledge of the public at large, commonly known as 
"crowdsourcing" , and (b) they provide an incentivized mechanism for an individual or team 
to apply their own knowledge and techniques which could be particularly beneficial to the 
problem at hand. This type of prediction competition provides a nice tool for companies 
and institutions that need help with a given prediction task yet can not afford to hire an 
expert. The potential leverage can be quite high: the Netflix prize winners apparently spent 
more than $1,000,000 in effort on their algorithm alone. 

Despite the extent of its popularity, is the Netflix competition model the ideal way to 
"crowdsource" a learning problem? We note several weaknesses: 

It is anti-collaborative. Competitors are strongly incentivized to keep their techniques 
private. This is in stark contrast to many other projects that rely on crowdsourcing - 
Wikipedia being a prime example, where participants must build off the work of others. 
Indeed, in the case of the Netflix prize, not only do leading participants lack incentives to 
share, but the work of non-winning competitors is effectively wasted. 

The incentives are skewed and misaligned. The winner-take-all prize structure means 
that second place is as good as having not competed at all. This ultimately leads to an equi- 
librium where only a few teams are actually competing, and where potential new teams never 
form since catching up seems so unlikely. In addition, the fixed achievement benchmark, set 
by Netflix as a 10% improvement in prediction RMSE over a baseline, leads to misaligned 
incentives. Effectively, the prize structure implies that an improvement of %9.9 percent 
is worth nothing to Netflix, whereas a 20% improvement is still only worth $1,000,000 to 
Netflix. This is clearly not optimal. 

The nature of the competition precludes the use of proprietary methods. By 

requiring that the winner reveal the winning algorithm, potential competitors utilizing non- 
open software or proprietary techniques will be unwilling to compete. By participating in 
the competition, a user must effectively give away his intellectual property. 

In this paper we describe a new and very general mechanism to crowdsource predic- 
tion/learning problems. Our mechanism requires participants to place bets, yet the space 
they are betting over is the set of hypotheses for the learning task at hand. At any given 
time the mechanism publishes the current hypothesis w and participants can wager on a 
modification of w to w', upon which the modified w' is posted. Eventually the wagering 
period finishes, a set of test data is revealed, and each participant receives a payout according 
to their bets. The critical property is that every trader's profit scales according to how well 
their modification improved the solution on the test data. 
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The framework we propose has many qualities similar to that of an information or pre- 
diction market, and many of the ideas derive from recent research on the design of au- 
tomated market makers [TJ, El El SI H]- Many information markets already exist; at sites 
like Intrad e . coml and Betf ai r . com^ individuals can bet on everything ranging from elec- 
tion outcomes to geopolitical events. There has been a burst of interest in such markets 
in recent years, not least of which is due to their potential for combining large amounts of 
information from a range of sources. In the words of Hanson et al [5]: "Rational expecta- 
tions theory predicts that, in equilibrium, asset prices will reflect all of the information held 
by market participants. This theorized information aggregation property of prices has lead 
economists to become increasingly interested in using securities markets to predict future 
events." In practice, prediction markets have proven impressively accurate as a forecasting 



The central contribution of the present paper is to take the framework of a prediction 
market as a tool for information aggregation and to apply this tool for the purpose of 
"aggregating" a hypothesis (classifier, predictor, etc.) for a given learning problem. The 
crowd of ML researchers, practitioners, and domain experts represents a highly diverse range 
of expertise and algorithmic tools. In contrast to the Netflix prize, which pitted teams of 
participants against each other, the mechanism we propose allows for everyone to contribute 
whatever knowledge they may have available towards the final solution. In a sense, this 
approach decentralizes the process of solving the task, as individual experts can potentially 
apply their expertise to a subset of the problem on which they have an advantage. Whereas a 
market price can be thought of as representing a consensus estimate of the value of an asset, 
our goal is to construct a consensus hypothesis reflecting all the knowledge and capabilities 
about a particular learning proble 

Layout: We begin in Section I2TT1 by introducing the simple notion of a generalized scoring 
rule L(-, •) representing the "loss function" of the learning task at hand. In Section 12.21 
we describe our proposed Crowdsourced Learning Mechanism (CLM) in detail, and discuss 
how to structure a CLM for a particular scoring function L, in order that the traders are 
given incentives to minimize L. In Section [3] we give an example based on the design of 
Huffman codes. In Section H] we discuss previous work on the design of prediction markets 
using an automated prediction market maker (APMM). We observe that any APMM is just 
a particular CLM and, moreover, we fully classify what types of problems can be solved with 
an APMM. In Section [5] we finish by considering two learning settings (e.g. linear regression) 
and we construct a CLM for each. The proofs have been omitted throughout, but these are 
available in the full version of the present paper. 

Notation: Given a smooth strictly convex function R : M. d — > R, and points x, y G dom(i2), 
we define the Bregman divergence D R (x, y) as the quantity -R(x) — R(y) — Vi?(y) • (x — y). 
For any convex function R, we let R* denote the convex conjugate of R, that is R*(y) '■— 

1 It is worth noting that Barbu and Lay utilized concepts from prediction markets to design algorithms 
for classifier aggregation |10) , although their approach was unrelated to crowdsourcing. 



tool [LT1EJCL2]. 
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su Pxgdom(_R) y ' x ~~ -R(x). We shall use A(S) to refer to the set of integrable probability 
distributions over the set S, and A n to refer to the set of probability vectors p G IR n . The 
function H : A n — > R shall denote the entropy function, that is H(p) := — Yli=i P(0 1°§P(*)- 
We use the notation KL(p; q) to describe the relative entropy or Kullback-Leibler divergence 
between distributions p,q G A„, that is KL(p;q) := J^" =1 p(i) log We will also use 
e, G R n to denote the ith standard basis vector, having a 1 in the zth coordinate and O's 
elsewhere. 

2 Scoring Rules and Crowdsourced Learning Mecha- 
nisms 

We shall now provide a full description of our proposed crowdsourced learning mechanism. 
We begin by discussing the notion of a scoring rule, a well-studied object from statistics for 
the purpose eliciting "good" probability forecasts [6]. We propose a weaker notion which 
we call a generalized scoring rule L(-,-) which shall reflect the loss function of the learning 
problem at hand. We then proceed to describe the CLM framework, and in particular we 
present the important case when a CLM implements a generalized scoring rule L. We provide 
a range of properties and results for L-CLMs. 

2.1 Generalized Scoring Rules 

For the remainder of this section, we shall let % denote some set of hypotheses, which we 
will assume is a convex subset of M n . We let O be some arbitrary set of outcomes. We use 
the symbol X to refer to either an element of O, or a random variable taking values in O. 

We recall the notion of a scoring rule, a concept that arises frequently in economics and 
statistics |6]. 

Definition 1. LetV C A(0) be some convex set of distributions on an outcome space O. A 
scoring rule is a function S : VxO — » R where, for all P G "P , P G argmaxg g7 , E x ~pS(Q, X). 

In other words, if you are paid S(P,X) upon stating belief P G V and outcome X 
occurring, then you maximize your expected utility by stating your true belief. We offer a 
much weaker notion: 

Definition 2. Given a convex hypothesis space H. C R n and an outcome space O , let L : 
% x O — > R be a continuous function. Given any P G A(0), let 

W L (P) := argmmE x ^p[L(w;X)]. 

Then we say that L is a Generalized Scoring Rule (GSR) if Wl(P) is a nonempty convex 
set for every P G A{0). 
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The generalized scoring rule shall represent the "loss function" for the learning problem 
at hand, and in Section 12.21 we will see how L is utilized in the mechanism. The hypothesis 
w shall represent the advice we receive from the crowd, X shall represent the test data to be 
revealed at the close of the mechanism, and L(w; X) shall represent the loss of the advised 
w on the data X. Notice that we do not define L to be convex in its first argument as 
this does not hold for many important cases. Instead, we require the weaker condition that 
Ex[L(w; X)] is minimized on a convex set for any distribution on X. 

Our scoring rule differs from traditional scoring rules in an important way. Instead of 
starting with the desire know about the true value of X, and then designing a scoring rule 
which incentivizes participants to elicit their belief P G V, our objective is precisely to 
minimize our scoring rule. In other words, traditional scoring rules were a means to an end 
(eliciting P) but our generalized scoring rule is the end itself. One can recover the traditional 
scoring rule definition by setting TL = V and imposing the constraint that P G Wl(P). 

A useful class of GSRs L are those based on a Bregman divergence. 

Definition 3. We say that a GSR L : H x O — > R is divergence-based if there exists an 
alternative hypothesis space %' C R m , for some m, where we can write 

L(w; X) = D R (p(X), V>(w)) + /(X) (1) 

for arbitrary maps p : O — > H', f : O — > R ; and ip '■ T~L — > W , and any closed strictly convex 
R : H' — > R whose convex conjugate R* is finite on all o/R m . 

This property allows us to think of L(w; X) as a kind of distance between p(X) and t/>(w). 
Clearly then, the minimum value of L for a given X will be attained when ^>(w) = p(X), 
given that Dr(jc, x) = for any Bregman divergence. In fact, as the following proposition 
shows, we can even think of the expected value E[L(w; X)], as a distance between E[p(X)] 
and 

Proposition 1. Given a divergence-based GSR L(w;X) = Dr(p(X),iJ)(w)) + /(X) and a 
belief distribution P on O, we have Wl(P) = V ; ~ 1 (Ex~p [p(X)]). 

Proof. All expectations in the following are over X ~ P. Expanding L, we have 

W L {P) = argminjE [R(p(X)) - R{i/)(w)) - Vi^(w)) • (p(X) - ^(w)) + /(X)] ) 

= argminjE [R(p(X)) + f(X)] - R(^(w)) - Vi^(w)) • (E[p(X)] - ^(w))) 

= argmin{ J R(E[p(X)]) - R(^(w)) - VR(^(w)) ■ (E[p(X)] - ^(w))) 

= argmin{ J D i? (E[p(X)],^(w))) = ^ (E[p(X)}) 

where the last line follows from the strict convexity of R and properties of divergences. □ 

We now can see that the divergence-based property greatly simplifies the task of mini- 
mizing L; instead of worrying about E[L(-;X)] one can simply base the hypothesis directly 
on the expectation E[p(X)]. As we will see in section HJ this also leads to efficient prediction 
markets and crowdsourcing mechanisms. 
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2.2 The Crowdsourced Learning Mechanism 

We will now define our actual mechanism rigorously. 

Definition 4. A Crowdsourced Learning Mechanism (CLM) is the procedure in Algorithm^ 
as defined by the tuple (H, O, Cost, Payout). The function Cost : H x % — > K sets the cost 
charged to a participant that makes a modification to the posted hypothesis. The function 
Payout : H x H x O — > K determines the amount paid to each participant when the outcome 
is revealed to be X. 

Algorithm 1 Crowdsourced Learning Mechanism for (H, O, Cost, Payout) 
1: Mechanism sets initial hypothesis to some wo G H 
2: for rounds t = 0, 1, 2, . . . do 
3: Mechanism posts current hypothesis w t G % 
4: Some participant places a bid on the update w t \-t w' 
5: Mechanism charges participant Cost(w t ,w') 
6: Mechanisms updates hypothesis Wt+i w' 
7: end for 

8: Market closes after T rounds and the outcome (test data) X G O is revealed 
9: for each t do 

10: Participant responsible for the update w t \— > w t+1 receives Payout (w t , w t+1 ; X) 
11: end for 



The above procedure describes the process by which participants can provide advice to 
the mechanism to select a good w, and the profit they earn by doing so. Of course, this 
profit will precisely determine the incentives of our mechanism, and hence a key question is: 
how can we design Cost and Payout so that participants are incentivized to provide good 
hypotheses? The answer is that we shall structure the incentives around a GSR L(w; X) 
chosen by the mechanism designer. 

Definition 5. For a CLM A = (H, O, Cost, Payout), denote the ex-post profit for the 
bid (w i — y w') when the outcome is X G O by Prof it(w, w ; ; X) := Payout (w, w'; X) — 
Cost(w, w'). We say that A implements a GSR L : W x O — > K. if there exists a surjective 
map (p : 7i — >■ %' such that for all w 1; w 2 G % and IG0, 

Prof it(wi, w 2 ; X) = Lfa(wi); X) - L(ip(w 2 ); X). (2) 

If additionally T-L' — % and if = idu, we call A an L-CLM and say that A is L-incentivized. 

When a CLM implements a given L, the incentives are structured in order that the 
participants will work to minimize L(w;X). Of course, the input X is unknown to the 
participants, yet we can assume that the mechanism has provided a public "training set" to 
use in a learning algorithm. The participants are thus asked not only to propose a "good" 
hypothesis w t but to wager on whether the update w t _i i— > w t improves generalization 
error. It is worth making clear that knowledge of the true distribution on X provides a 
straightforward optimal strategy. 
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Proposition 2. Given a GSR L : Tl x O —¥ K and an L-CLM (Cost, Payout), any par- 
ticipant who knows the true distribution P G P over X rail maximize expected profit by 
modifying the hypothesis to any w G Wl(P). 

Proof. By equation (jSJ) we can directly compute the expected profit; for any current hypoth- 
esis w G H, we have 

argmaxExgp [Profit (w, w'; X)] = argmaxE^ep [L(w; X) — L(w';X)] 

= argminE XeP [L(w , ;X)] = W L (P), 
w'eH 

which completes the proof. □ 

Cost of operating a CLM. It is clear that the agent operating the mechanism must pay 
the participants at the close of the competition, and is thus at risk of losing money (in fact, 
it is possible he may gain). How much money is lost depends on the bets (w t i— y Wt+i) made 
by the participants, and of course the final outcome X. The agent has a clear interest in 
knowing precisely the potential cost - fortunately this cost is easy to compute. The loss to 
the agent is clearly the total ex-post profit earned by the participants, and by construction 
this sum telescopes: Prof it (wt, w m ; X) = L(w ; X) — L(w T ; X). This is a simple yet 

appealing property of the CLM: the agent pays only as much in reward to the participants 
as it benefits from the improvement of over the initial wo. It is worth noting that this 
value could be negative when wy is actually "worse" than wo; in this case, as we shall 
see in section [31 the CLM can act as an insurance policy with respect to the mistakes 
of the participants. A more typical scenario, of course, is where the participants provide 
an improved hypothesis, in which case the CLM will run at a cost. We can compute the 
WorstCaseLoss(L-CLM) := maxw^^go (L(wo; X) — L(w; X)). Given a budget of size $B, 
the mechanism can always rescale L in order that WorstCaseLoss(L-CLM) = B. This 
requires, of course, that the WorstCaseLoss is finite. 

Computational efficiency of operating a CLM. We shall say that a CLM has the 
efficient computation (EC) property if both Cost and Payout are efficiently computable 
functions. We shall say a CLM has the tractable trading (TT) property if, given a current 
hypothesis w, a belief P G A(0) and a budget B, one can efficiently compute an element of 
the set 

argmaxl Ex~p [Profit (w, w',X)l : Cost(w, w') < b\. 

The EC property ensures that the mechanism operator can run the CLM efficiently. The 
TT property says that participants can compute the optimal hypothesis to bet on given a 
belief on the outcome and a budget. This is absolutely essential for the CLM to successfully 
aggregate the knowledge and expertise of the crowd - without it, despite their motivation 
to lower L(; ), the participants would not be able to compute the optimal bet. 
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Suitable collateral requirements. We say that a CLM has the escrow (ES) property if 
the Cost and Payout functions are structured in order that, given any wager (w \-¥ w'), we 
have that Payout (w, w'; X) > for all X G O. It is clear that, when designing an L-CLM 
for a particular L, the Payout function is fully specified once Cost is fixed, since we have 
the relation Payout(w, w'; X) = L(w; X) — L(w'\ X) + Cost(w, w') for every w, w' G % and 
X G O. A curious reader might ask, why not simply set Cost(w, w') = and Payout 
Prof 1 The problem with this approach is that potentially Payout (w, w'; X) < which 
implies that the participant who wagered on (w y w') can be indebted to the mechanism 
and could default on this obligation. Thus the Cost function should be set in order to require 
every participant to deposit at least enough collateral in escrow to cover any possible losses. 

Subsidizing with a voucher pool. One practical weakness of a wagering-based mech- 
anism is that individuals may be hesitant to participate when it requires depositing actual 
money into the system. This can be allayed to a reasonable degree by including a voucher 
pool where each of the first m participants may receive a voucher in the amount of $C. These 
candidates need not pay to participate, yet have the opportunity to win. Of course, these 
vouchers must be paid for by the agent running the mechanism, and hence a value of mC is 
added to the total operational cost. 

3 Warm-up: Compressing an Unfamiliar Data Stream 

Let us now introduce a particular setting motivated by a well-known problem in information 
theory. Imagine a firm is looking to do compression on an unfamiliar channel, and from this 
channel the firm will receive a stream of m characters from an n-sized alphabet which we 
shall index by [n]. The goal is to select a binary encoding of this alpha in such a way that 
minimizes the total bits required to store the data, as a cost of $1 is required for each bit. 

A first-order approach to encode such a stream is to assign a probability distribution 
q G A n to the alphabet, and to select an encoding of character % with a binary word of 
length log(l/q(i)) (we ignore round-off for simplicity). This can be achieved using Huffman 
Codes for example, and we refer the reader to Cover and Thomas ([5], Chapter 5) for more 
details. Thus, given a distribution q, the firm pays L(q;z) = — logq(i) for each character 
i. It is easy to see that if the characters are sampled from some "true" distribution p, then 
the expected cost L(q; p) := E^ p [L(q; i)) = KL(p; q) + H(p), which is minimized at q = p. 
Not knowing the true distribution p, the firm is thus interested in finding a q with a low 
expected cost L(q; p). 

An attractive option available to the firm is to crowdsource the task of lowering this cost 
L(-; •) by setting up an L-CLM. It is reasonably likely that outside individuals have private 
information about the behavior of the channel and, in particular, may be able to provide a 
better estimate q of the true distribution of the characters in the channel. As just discussed, 
the better the estimate the cheaper the compression. 

We set "H = A„ and O = [n], where a hypothesis q represents the proposed distribution 
over the n characters, and X is some character sampled uniformly from the stream after it 
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has been observed. We define Cost and Payout as 
Cost(q, q') := max log(q(i)/q'(i)), Payout(q, q'; i) : = log(q(i)/q'(i)) + Cost(q, q'), 

ie[n] 

which is clearly an L-CLM for the loss defined above. It is worth noting that L is a divergence- 
based GSR if we take R(q) = — if(q), p(i) — e i: f = 0, ip = idA n , using the convention 
logO = (in fact, L is the LMSR). Finally, the firm will initially set q to be its best guess 
of p, which we will assume to be uniform (but need not be). 

We have devised this payout scheme according to the selection of a single character i, 
and it is worth noting that because this character is sampled uniformly at random from 
the stream (with private randomness), the participants cannot know which character will 
be released. This forces the participants to wager on the empirical distribution p of the 
characters from the stream. A reasonable alternative, and one which lowers the payment 
variance, is to payout according to the L(q; p), which is also equal to the average of L(q; i) 
when % is chosen uniformly from the stream. 

The obvious question to ask is: how does this CLM benefit the firm that wants to design 
the encoding? More precisely, if the firm uses the final estimate q^ from the mechanism, 
instead of the initial guess qo, what is the trade-off between the money paid to participants 
and the money gained by using the crowdsourced hypothesis? At first glance, it appears 
that this trade-off can be arbitrarily bad: the worst case cost of encoding the stream using 
the final estimate q^ is sup iqT — log(q T (i)) = oo. Amazingly, however, by virtue of the 
aligned incentives, the firm has a very strong control of its total cost (the CLM cost plus 
the encoding cost). Suppose the firm scales L by a parameter a, to separate the scale of the 
CLM from the scale of the encoding cost (which we assumed to be $1 per bit). Then given 
any initial estimate q and final estimate qr, the expected total cost over p is 

Encoding cost of using <\t given p Mechanism's cost of getting advice 

Total expected cost = H(p) + KL(p; q T ) + a(KL(p; q ) — KL(p; q T )) 

= #(p) + (l-a)KL(p;q T ) + aKL(p;q ) 

Let us spend a moment to analyze the above expression. Imagine that the firm set a — 1. 
Then the total cost of the firm would be H(p) + KL(p; q ), which is bounded by logn for 
qo uniform. Notice that this expression does not depend on q*r - in fact, this cost precisely 
corresponds to the scenario where the firm had not set up a CLM and instead used the initial 
estimate q to encode. In other words, for a — 1, the firm is entirely neutral to the quality of 
the estimate q T ; even if the CLM provided an estimate q T which performed worse than q , 
the cost increase due to the bad choice of q is recouped from payments of the ill-informed 
participants. 

The firm may not want to be neutral to the estimate of the crowd, however, and under 
the reasonable assumption that the final estimate qr will improve upon q , the firm should 
set < a < 1 (of course, positivity is needed for nonzero payouts). In this case, the firm will 
strictly gain by using the CLM when KL(p; q T ) < KL(p;q ), but still has some insurance 
policy if the estimate q^ is poor. 
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4 Prediction Markets as a Special Case 



Let us briefly review the literature for the type of prediction markets relevant to the present 
work. In such a prediction market, we imagine a future event to reveal one of n uncertain 
outcomes. Hanson [7J[8] proposed a framework in which traders make "reports" to the market 
about their internal belief in the form of a distribution p G A„. Each trader would receive 
a reward (or loss) based on a function of their proposed belief and the belief of the previous 
trader, and the function suggested by Hanson was the Logarithmic Market Scoring Rule 
(LMSR). It was shown later that the LMSR-based market is equivalent to what is known as 
a cost function based automated market makers, proposed by Chen and Pennock [3J. More 
recently a much broader equivalence was established by Chen and Wortman Vaughan jl] 
between markets based on cost functions and those based on scoring rules. 

The market framework proposed by Chen and Pennock allows traders to buy and sell 
Arrow-Debreu securities (equivalently: shares, contracts), where an Arrow-Debreu security 
corresponding to outcome i pays out $1 if and only if i is realized. All shares are bought 
and sold through an automated market maker, which is the entity managing the market and 
setting prices. At any time period, traders can purchase bundles of contracts r G R n , where 
r(i) represents the number of shares purchased on outcome i. The price of a bundle r is 
set as C(s + r) — C(s), where C is some different iable convex cost function and s 6 R™ is 
the "quantity vector" representing the total number of outstanding shares. The LMSR cost 
function is C(s) := ^ log (J2i=i ex P(^ s (*)))- 

This cost function framework was extended by Abernethy et al. [T] to deal with pro- 
hibitively large outcome spaces. When the set of potential outcomes O is of exponential 
size or even infinite, the market designer can offer a restricted number of contracts, say n 
(^C \0\), rather than offer an Arrow-Debreu contract for each member of O. To determine 
the payout structure, the market designer chooses a function p : O — >■ R™, where contract i 
returns a payout of Pi{X) and, thus, a contract bundle r pays p{X) ■ r. As with the frame- 
work of Chen and Pennock, the contract prices are set according to a cost function C, so 
that a bundle r has a price of C(s + r) — C(s). The design of the function C is addressed at 
length in Abernethy et al., to which we refer the reader. 

For the remainder of this section we shall discuss the prediction market template of 
Abernethy et al. as it provides the most general model; we shall refer to such a market 
as an Automated Prediction Market Maker. We now precisely state the ingredients of this 
framework. 

Definition 6. An Automated Prediction Market Maker (APMM) is defined by a tuple 
(S, O, p, C) where S is the share space of the market, which we will assume to be the linear 
space R n ; O is the set of outcomes; C : S — >■ R is a smooth and convex cost function with 
VC(«S) = relint(VC(«S)) (here, we use VC(«S) := (VC(s) | s G S} to denote the derivative 
space of C); and p : O — )■ VC(S) is a payoff function 

Fortunately, we need not provide a full description of the procedure of the APMM mech- 
anism: The APMM is precisely a special case of a CLM! Indeed, the APMM framework can 
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be described CLM (H, O, Cost, Payout) where 

n = S(=R n ) Cost(s,s / ) = C(s')-(7(s) Payout(s,s';X) = p(X) ■ (s'-s). (3) 

Hence we can think of APMM prediction markets in terms of our learning mechanism. 
Markets of this form are an important special class of CLMs - in particular, we can guarantee 
that they are efficient to work with, as we show in the following proposition. 

Proposition 3. An APMM (S, O, p, C) with a efficiently computable C satisfies the EC and 
TT properties. 

Proof. Computing Cost and Payout, defined in ((3]), requires simply evaluating C(-) at two 
inputs and evaluating p(-) and taking a dot product, all of which are polynomial-in-n oper- 
ations. Hence an APMM satisfies the EC property. Furthermore, for a participant to make 
an optimal trade with belief P under a budget constraint B, she must simply compute, for 
any fixed s', 

argmax E x „ P \p(X)] ■ (s - s') - C(s) + tf(s'). 

sgM":C(s)-C(s')<B 

The latter objective is a standard convex optimization problem in n parameters which can 
be solved efficiently. □ 

We now ask, what is the learning problem that the participants of an APMM are trying 
to solve? More precisely, when we think of an APMM as a CLM, does it implement a 
particular L? 

Lemma 1. Let APMM A = (S, O, p, C) be given. Then A implements the GSR L : VC(<S) x 
O -> R defined by 

L(w;X) = D c *(p(X),w) + f(X), (4) 
where C* is the conjugate dual of the function C and f is arbitrary. 

Proof. We analyze the profits for trades in A. Let pt = VC(s t ) be the instantaneous prices 
at time t. The ex-post profit of the bet (s t y s t +i) is then 

Profit (a*, st+i; X) = (s m - s t ) ■ p(X) - C(s t+1 ) + C(s t ) 

= (st+i - st) • p(X) - (p t+ i ■ st+i - C*(p m )) + (p< ■ st - C*(p t )) 
= C*(p t+1 ) + s t+1 ■ (p{X) - Pm ) - C*(p t ) - s t ■ (p(X) - p t ) (5) 

Now note that since C is closed and convex, by duality we can write 

C(s)= sup {s-p-C-*(p)}, (6) 

pGVC(5) 

for all s. Using standard techniques in conjugate duality theory, we can conclude that the 
sup is achieved for p = VC(s); we briefly sketch this argument here. Since C* is the 
conjugate dual of C, we have C*(VC(s)) = sup s / gK „ s' • VC(s) — C(s'). As this objective is 
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unconstrained, we see that an optimal choice of s' is identically s. This gives the following 
equality, 

C*(VC(s)) = s- VC(s) -C(s). (7) 

Of course, by reconciling equations flSJ) and (j7j), we see that one optimal choice of p is VC(s); 
indeed this is the only choice, although we need not prove this statement. 

Continuing, since this p := VC(s) maximizes the objective in (Q, we must have 

= v • V p (s ■ P - C*(p)) = v ■ (s - VC*(p)), 

for any direction v G VC(«S) — {p}. This is because p G VC(«S) = relint(VC(«S)) by 
assumption, so if the directional derivative were nonzero the objective would increase in 
the direction of v or —v. Now since p(0) C VC(«S) according to Definition [61 we have 
p(X) - p G VC(«S) - {p} for all p. Thus, 

s m ■ (p(X) - Pm ) = VC*(p t+1 ) • (p{X) - Pm ), and 

4 • (p(X) - Pt ) = VC*( Pt ) ■ (p(X) - Pt ). 1 j 

Finally, applying (JgJ and adding C*{p{X)) - C*(p{X)) to © yields 

Prof it(s t , st+i; X) = £> c . (p(X), p t ) - D c *{p{X), p t+1 ), (9) 

Note that if we had instead added C*(p) — C*(p), where p = Ex~p [p(X)], we would see 
that the expected ex-post profit under P is Dc*(E[p(X)], p t ) — Dc*(E[p(X)}, pt+i). We now 
have 

Prof it(s t) st+i; X) = L(VC(s t ); X) - L(VC(s t+1 ); X), 

since the /(X) terms cancel; the implementation property thus follows with the surjective 
map (p : s VC(s). □ 

There is another more subtle benefit to APMMs - and, in fact, to most prediction market 
mechanisms in practice - which is that participants make bets via purchasing of shares or 
share bundles. When a trader makes a bet, she purchases a contract bundle r, is charged 
C(s + r) — C(s) (when the current quantity vector is s), and shall receive payout p(X) ■ r 
if and when X is realized. But at any point before X is observed and trading is open, the 
trader can sell off this bundle, to the APMM or another trader, and hence neutralize her 
risk. In this sense bets made in an APMM are stateless, whereas for an arbitrary CLM this 
may not be the case: the wager defined by (w 4 h-> w t+ i) can not necessarily be sold back to 
the mechanism, as the posted hypothesis may no longer remain at Wt+i- 

Given a learning problem defined by the GSR L : % x O — > R, it is natural to ask 
whether we can design a CLM which implements this L and has this "share-based property" 
of APMMs. More precisely, under what conditions is it possible to implement L with an 
APMM? 

Theorem 1. For any divergence-based GSR L(w;X) = Dn(p(X),ip(w)) + f{X), with ip : 
"H — > H' one-to-one, TL' = relint(7/'), and p{0) C ip(T-L), there exists an APMM which 
implements L. 
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Proof. Recall the functions involved from Definition [3j p : O — > H', f : O — > K., ip : Ti — > %' ', 
and finally R : W — > R is a closed strictly convex function. In order to construct an APMM 
we set C = R*, the conjugate dual of R, and S = dom(i?*). Note that S = M. n by assumption. 
Using standard results of conjugate duality we have VC(<S) = dom(i?) = 7-L', and since R is 
closed and convex, we also have C* = (R*)* = R. 

Since we have p(0) C W = VC(S), we can construct an APMM A = (S,G,p,C). 
By from Lemma [U we can write the profit of the bet (s t i— > s t+1 ) in A as 

Prof it(s t , s m ; X) = D R (p(X),VC(s t )) - D R (p(X), VC(s m )). (10) 

As ijj is one-to-one by assumption, we can now write 

Prof it(st, s m ; X) = L(V- 1 (VC(s t )); X) - L(^- 1 (VC( Sm )); X), 

since the f{X) terms cancel. Finally, since VC : S — > W and : "HI — > H are surjective, 
we see that A implements L with map if = o VC. □ 

We point out that if an APMM implements some arbitrary L, not necessarily the canoni- 
cal Lq established in Lemma [U then L is effectively equivalent to L and, hence, is divergence 
based. This fully specifies the class of problems solvable using APMMs. 

Theorem 2. If APMM (<S, O, p, C) implements a GSR L:HxC^l, then L is divergence- 
based. 

Proof. Given that (S, O, p, C) implements L, we know there exists some surjective map 
ip : S -> H so that Prof it (s, s'; X) = L(tp(s); X) - L(ip(s'); X) for all s, s' G S. Of course, 
applying Lemma Q] we now have, 

D R (p(X), VC(s)) - D R (p(X),VC(s')) = Prof it (s, s'; X) = L{<p(s); X) - L^(s'); X), 

where R = C* . Focusing on L(-; •) as a function of s, and fixing s' arbitrarily, we see that 

L(<p(s); X) = D R (p(X), VC(s)) + (L(<p(b?);X) - D R (p(X), VC(s'))) 
= D R (p(X),VC(s)) + f(X) 

for some / : O — > R, since L((p(s);X) cannot depend on the unbound variable s'. Further- 
more, for any w, s such that w = <p(s), we must have 

L(w; X) = D R (p(X), VC(s)) + f(X). 
Now since (p is surjective, there exists a map <^ with (p o (p = id%. Hence, 

L(w;A)^ J D /? (p(A),^(w)) + /(A), 
where ^ = VC o ^5. □ 
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Theorem[2]establishes a strong connection between prediction markets and a natural class 
of GSRs. One interpretation of this result is that any GSR based on a Bregman divergence 
has a "dual" characterization as a share-based market, where participants buy and sell shares 
rather than directly altering the share prices (the hypothesis). This has many advantages 
for prediction markets, not least of which is that shares are often easier to think about than 
the underlying hypothesis space. 

Our notion of a CLM offers another interpretation. In light of Proposition [31 any machine 
learning problem whose hypotheses can be evaluated in terms of a divergence leads to a 
tractable crowdsourcing mechanism, as was the case in Section |3j Moreover, this theorem 
does not preclude efficient yet non-divergence-based loss functions as we see in the next 
section. 

5 Example CLMs for Typical Machine Learning Tasks 

Regression. We now construct a CLM for a typical regression problem. We let % be the 
£ 2 -norm ball of radius 1 in M. d , and we shall let an outcome be a batch of a data, that is 
X := {(xi,yi), . . . , (Xjj, y n )} where for each i we have Xj G M. d , yi G [—1, 1], and we assume 
Il x i||2 < 1- We construct a GSR according to the mean squared error, L(w; {(x^ ?/i)}™ =1 ) = 
7T- X)iLi( w ' x « — Hi) 2 f° r some parameter a > 0. It is worth noting that L is not divergence- 
based. 

In order to satisfy the escrow property (ES), we can set Cost(w, w') := 2a||w — w'j^ 
because the function L(w; X) is 2a-lipschitz with respect to w for any X. To ensure that the 
CLM is L-incentivized, we must set Payout (w, w'; X) := Cost(w, w') +L(w; X) — L(w'; X). 

If we set the initial hypothesis w = 0, it is easy to check that Worst CaseLoss = a/2. It 
remains to check whether this CLM is tractable. It's clear that we can efficiently compute 
Cost and Payout, hence the EC property holds. Given how Cost is defined, it is clear that 
the set {w' : Cost(w, w') < B} is just an £ 2 - n orm ball. Also, since L is convex in w for each 
X, so is the function Ex~p [Profit (w, w', X)l for every P. A budget-constrained profit- 
maximizing participant must simply solve a convex optimization problem, and hence the TT 
property holds. 

Betting Directly on the Labels. Let us return our attention to the Netflix Prize model 
as discussed in the Introduction. For this style of competition a host releases a dataset 
for a given prediction task. The host then requests participants to provide predictions on 
a specified set of instances on which it has correct labels. For every submission the agent 
computes an error measure, say the MSE, and reports this to the participants. Of course, 
the correct labels are withheld throughout. 

Our CLM framework is general enough to apply to this problem framework as well. Define 
7i = O — K m where K C R bounded is the set of valid labels, and m is the number of 
requested test set predictions. For some w G "H and y G O, w{k) specifies the kth. predicted 
label, and y{k) specifies the true label. A natural scoring function is the total squared loss, 
L(w; y) := X1a-1i( w (^) — y(^)) 2 - Of course, this approach is quite different from the Netflix 
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Prize model, in two key respects: (a) the participants have to wager on their predictions and 
(b) by participating in the mechanism they are required to reveal their modification to all 
of the other players. Hence while we have structured a competitive process the participants 
are de facto forced to collaborate on the solution. 

A reasonable critique of this collaborative mechanism approach to a NetfTix-style compe- 
tition is that it does not provide the instant feedback of the "leaderboard" where individuals 
observe performance improvements in real time. However, we can adjust our mechanism to 
be online with a very simple modification of the CLM protocol, which we sketch here. Rather 
than make payouts in a large batch at the end, the competition designer could perform a 
mini-payout at the end of each of a sequence of time intervals. At each interval, the designer 
could select a (potentially random) subset S of user/movie pairs in the remaining test set, 
freeze updates on the predictions w(fc) for all k G S, and perform payouts to the participants 
on only these labels. What makes this possible, of course, is that the generalized scoring 
rule we chose decomposes as a sum over the individual labels. 

With this online approach just discussed, let us end with one final observation. Given 
that some firm such as Netflix would like to make good predictions on a given data source, the 
firm could potentially rely entirely on the advice from a CLM. The firm could post batches 
of data on which it does not have labels and ask the participants to provide predictions via 
the CLM. The firm could then pay out according to a small sample which are manually 
labeled or whose labels are received in the mean time. But by not revealing on which subset 
the labels will arrive, the firm receives potentially good predictions on the full set. This 
could provide a valuable market between small firms which have machine learning needs and 
"freelance" machine learning bounty hunters. 
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